A Succinct N-gram Language Model

نویسندگان

Taro Watanabe

Hajime Tsukada

Hideki Isozaki

چکیده

Efficient processing of tera-scale text data is an important research topic. This paper proposes lossless compression of N gram language models based on LOUDS, a succinct data structure. LOUDS succinctly represents a trie with M nodes as a 2M + 1 bit string. We compress it further for the N -gram language model structure. We also use ‘variable length coding’ and ‘block-wise compression’ to compress values associated with nodes. Experimental results for three large-scale N -gram compression tasks achieved a significant compression rate without any loss.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Randomized Language Models via Perfect Hash Functions

We propose a succinct randomized language model which employs a perfect hash function to encode fingerprints of n-grams and their associated probabilities, backoff weights, or other parameters. The scheme can represent any standard n-gram model and is easily combined with existing model reduction techniques such as entropy-pruning. We demonstrate the space-savings of the scheme via machine tran...

متن کامل

Succinct Data Structures for NLP-at-Scale

Succinct data structures involve the use of novel data structures, compression technologies, and other mechanisms to allow data to be stored in extremely small memory or disk footprints, while still allowing for efficient access to the underlying data. They have successfully been applied in areas such as Information Retrieval and Bioinformatics to create highly compressible in-memory search ind...

متن کامل

Class-based language model adaptation using mixtures of word-class weights

This paper describes the use of a weighted mixture of classbased n-gram language models to perform topic adaptation. By using a fixed class n-gram history and variable word-given-class probabilities we obtain large improvements in the performance of the class-based language model, giving it similar accuracy to a word n-gram model, and an associated small but statistically significant improvemen...

متن کامل

Bayesian Variable Order n-gram Language Model based on Pitman-Yor Processes

This paper proposes a variable order n-gram language model by extending a recently proposed model based on the hierarchical Pitman-Yor processes. Introducing a stochastic process on an infinite depth suffix tree, we can infer the hidden n-gram context from which each word originated. Experiments on standard large corpora showed validity and efficiency of the proposed model. Our architecture is ...

متن کامل

Approximate N-Gram Markov Model for Natural Language Generation

This paper proposes an Approximate n-gram Markov Model for bag generation. Directed word association pairs with distances are used to approximate (n-1)-gram and n-gram training tables. This model has parameters of word association model, and merits of both word association model and Markov Model. The training knowledge for bag generation can be also applied to lexical selection in machine trans...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

A Succinct N-gram Language Model

نویسندگان

چکیده

منابع مشابه

Randomized Language Models via Perfect Hash Functions

Succinct Data Structures for NLP-at-Scale

Class-based language model adaptation using mixtures of word-class weights

Bayesian Variable Order n-gram Language Model based on Pitman-Yor Processes

Approximate N-Gram Markov Model for Natural Language Generation

عنوان ژورنال:

اشتراک گذاری